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We consider a problem of recovering a high-dimensional vector \x 
observed in white noise, where the unknown vector /j, is assumed to 
be sparse. The objective of the paper is to develop a Bayesian formal- 
ism which gives rise to a family of io-type penalties. The penalties 
are associated with various choices of the prior distributions n n (-) 
on the number of nonzero entries of fj, and, hence, are easy to in- 
terpret. The resulting Bayesian estimators lead to a general thresh- 
olding rule which accommodates many of the known thresholding 
and model selection procedures as particular cases corresponding to 
specific choices of 7r„(-). Furthermore, they achieve optimality in a 
rather general setting under very mild conditions on the prior. We 
also specify the class of priors n„(-) for which the resulting estima- 
tor is adaptively optimal (in the minimax sense) for a wide range of 
sparse sequences and consider several examples of such priors. 

1. Introduction. Consider a problem of estimation of a high-dimensional 
multivariate Gaussian mean with independent terms and common variance, 

(1.1) yi = m + azi, Zi N(0, 1), i = l,...,n. 

The variance a 2 is assumed to be known and the goal is to estimate the 
unknown mean vector fi from a set Q n C M n . This is a well-studied problem 
that arises in various statistical settings, for example, model selection or 
orthonormal regression. 

Some extra assumptions are usually placed on Q n . We assume that the 
vector [i is sparse, that is, most of its entries are zeroes or "negligible" and 
only a small fraction is "significantly large." The indices of the large entries 
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are, however, not known in advance. Formally, the sparsity assumption can 
be quantified in terms of so-called nearly-black objects (Donoho et al. [12]) 
or strong and weak £ p -balls (Johnstone [20], Donoho and Johnstone [10, 11]) 
discussed below. In sparse cases the natural estimation strategy is thresh- 
olding. 

It is well known that various thresholding rules can be considered as 
penalized likelihood estimators minimizing 

(1-2) ||i/-m||| + P(m) 

for the corresponding penalties P((a). The traditional I2 penalty P(n) = 
A^II/ilH does not lead to a thresholding but to a linear shrinkage estimator 
Ai = TXP Vi- The l\ penalty produces a "shrink-or-kill" soft thresholding, 
where fi* = sign(yi)(\yi\ — A n /2) + , which coincides with the LASSO estima- 
tor of Tibshirani [25]. The general l p , p > 0, penalty yields bridge regression 
(Frank and Friedman [16]) and results in thresholding when p < 1. Wider 
classes of penalties leading to various thresholding rules are discussed in 
Antoniadis and Fan [6], Fan and Li [13] and Hunter and Li [19]. 

All the penalties mentioned above are related to magnitudes of /ij. In this 
paper we consider the Iq, or complexity type penalties, where the penalty 
is placed on the number of nonzero /x, . The Iq quasi- norm of a vector [i is 
defined as the number of its nonzero entries, that is, ||/x||o = : Hi 7^ 0}. In 
the simplest case, the complexity penalty P{pi) = A n ||/i||o and minimization 
of (1.2) obviously result in minimizing 

n 

(1-3) E Vh +X n k 

i=k+l 

over k, where \y\m > ■ • ■ > |y|( n V Such a procedure implies a "keep-or-kill" 
hard thresholding with a (fixed) threshold X n , 

A* = yil{\yi\> An}, i = l,...,n. 

The widely-known universal threshold of Donoho and Johnstone [9] is 
Xu = o~V 2 In n and, as n — > 00, the resulting estimator comes within a con- 
stant factor of asymptotic minimaxity for V losses simultaneously through- 
out a range of various sparsity classes (Donoho and Johnstone [10, 11]). 

A complexity penalization of type (1.3) is closely connected to model 
selection. For example, the Akaike's [5] AIC model selection rule takes A n = 
yj2a, the Schwarz [24] BIC criterion corresponds to A n = av / hTn, while the 
RIC criterion of Foster and George [14] adjusted for (1.1) implies A„ = 
n. 

A natural extension of (1.3) is to consider a variable penalization sequence 
A^ fit that is, 

(i-4) lly-HI 2 +E°A'„- 

i=0 
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Let k be a minimizer of 

n k 

(1-5) EtfiJ+E^n 

j=fc+l i=0 

over fe. The resulting minimizer /}* of (1.4) is obviously a hard thresholding 
rule with a variable threshold A? :jl* = yil{\yi\ > \t }■ If k = 0, all are 
thresholded and /}* = 0. 

Several variable penalty estimators of the type (1.4) have been proposed 
in the literature. The FDR-thresholding rule of Abramovich and Benjamini 
[2] corresponds to Aj >n = az(l — (i/n)(q/2)) ~ ay / 2ln((n/i)(2/q)) , where z(-) 
are standard Gaussian quantiles and q is the tuning parameter of the FDR 
procedure. Abramovich et al. [4] showed that, for q < 1/2, the FDR estima- 
tor achieves sharp (with a right constant) asymptotic minimaxity, simulta- 
neously over an entire range of nearly black sets and Zp-balls with respect 
to l r losses. Foster and Stine [15] suggested Aj. n = <7\/21n(n/i) from infor- 
mation theory considerations. The covariance inflation criterion for model 
selection of Tibshirani and Knight [26] adjusted for (1.1) corresponds to 
\ n = 2<r- v /ln(n/i). 

For a general ^o-type penalty P n (||//||o), the corresponding penalized es- 
timator (i* is a hard thresholding rule with the data-dependent threshold 
\y\ (fe)' wri ere k is the minimizer of 

n 

(1-6) E Vh + p n(k). 

i=k+l 

Obviously, (1.5) can be viewed as a particular case of (1.6). 

A series of recent papers has considered the 2/cln(n//c)-type penalties of 
the form 

(1.7) P n (k) = 2a 2 (k(ln(n/k) + c M ), 

where £ > 1 and Cfc jn is a negligible term relative to ln(n//c) for k <S n (sparse 
cases) (e.g., Birge and Massart [8], Johnstone [21] and Abramovich et al. [4]). 

A wide class of Zo-type- penalties satisfying certain technical conditions 
was considered in Birge and Massart [8]. However, most of their results on 
optimality have been obtained for a particular 2&ln(n/A;)-type penalty, while 
it remains somewhat unclear how to construct "meaningful" penalties from 
their class in general. 

The objective of this paper is to develop a framework for Zo-penalization 
which is general and meaningful at the same time. The Bayesian approach 
provides a natural interpretation of the penalized likelihood estimators by 
relating the penalty models to the corresponding prior distribution on \x. For 
the model (1.1), the penalty term in (1.2) is then proportional to the loga- 
rithm of the prior. From a Bayesian view, minimization of (1.2) corresponds 
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to the maximum a posteriori (MAP) rule and the resulting penalized esti- 
mator (called MAP thereafter) is the posterior mode. The l p -type penalties 
for p > correspond to placing priors on the magnitudes of /ij, while the 
/o-type penalties necessarily involve a prior on the number of nonzero /U,. 

In this paper we develop a Bayesian formalism which gives rise to a family 
of Zo-type penalties in (1.6). This family is associated with various choices 
of the prior distributions on the number of nonzero entries of the unknown 
vector and, hence, is easy to interpret. Moreover, under mild conditions, the 
penalties considered in this paper fall within the class considered in Birge 
and Massart [8] , which allows us to establish optimality of the corresponding 
Bayesian estimators in a rather general setting. We then demonstrate that 
in the case when the vector fx is sparse, the MAP estimators achieve optimal 
convergence rates. We also specify the class of prior distributions for which 
the resulting estimators are adaptive for a wide range of sparse sequences 
and provide examples of such priors. 

The paper is organized as follows. In Section 2 we introduce the Bayesian 
MAP "testimation" procedure leading to penalized estimators (1.6). In Sec- 
tion 3 we derive upper bounds for their quadratic risk and compare it with 
that of an ideal oracle estimator (oracle inequality). Asymptotic optimality 
of the proposed MAP "testimators" in various sparse settings is established 
in Section 4. Several specific priors ir n (k) are considered in Section 5 as 
examples. In Section 6 we present a short simulation study to demonstrate 
the performance of MAP estimators and compare them with several existing 
counterparts. Some concluding remarks are given in Section 7. All the proofs 
are placed in the Appendix. 

2. MAP testimating procedure. 

2.1. Thresholding as testimation. Abramovich and Benjamini [2] demon- 
strated that thresholding can be viewed as a multiple hypothesis testing 
procedure, where, given the data y = (y±, . . . ,y n )' in (1.1), one first simulta- 
neously tests fa, i = 1, . . . , n, for significance. Those /Vs which are concluded 
to be significant are estimated by the corresponding y^, while nonsignificant 
/Vs are discarded. Such a testimation procedure obviously mimics a hard 
thresholding rule. 

In particular, the likelihood ratio test rejects the null hypothesis H^i : /ij = 
if and only if \y\i > A n , where controlling the family wise error at level a 
by the Bonferroni approach leads to A n = az(l — a/(2n)) ~ a\j2\un = Xu 
for any reasonable a. In other words, universal thresholding may be viewed 
as a Bonferroni multiple testing procedure with familywise error level of 
approximately l/\/lnn, which slowly approaches zero as n increases. Such a 
severe error control explains why universal thresholding is so conservative. 
A less stringent alternative to a familywise error control is the false discovery 
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rate (FDR) criterion of Benjamini and Hochberg [7]. The corresponding 
FDR thresholding was considered in Abramovich and Benjamini [2, 3] in the 
context of wavelet series estimation and comprehensively developed further 
in Abramovich et al. [4] for a general normal means problem setting. 

In this paper we shall follow a more general testimation approach to 
thresholding based on the multiple hypothesis testing procedure introduced 
by Abramovich and Angelini [1] , which efficiently utilizes the Bayesian frame- 
work. We shall review this approach in the following section. 

2.2. MAP multiple testing procedure. For the model (1.1), consider the 
multiple hypothesis testing problem, where we wish to simultaneously test 
H 0i :fii = against Hu-^ij^O, i = l,..., n. 

A configuration of true and false null hypotheses is uniquely defined by 

the indicator vector x, where X{ = I{fJ.i 7^ 0}, i = 1, . . . , n. Let k = x\ H + 

x n = \\n\\o be the number of significant \ii (false nulls). Assume some prior 
distribution k ~ ir n (k) > 0, k = 0, . . . ,n. For a given k there are (^) various 
configurations of true and false null hypotheses. Assume all of them to be 
equally likely a priori, that is, conditionally on k, 



Naturally, fii\xi = ~ 5q, where Jo is a probability atom at zero. To complete 
the prior, assume \ii\xi = 1 ~ iV(0, r 2 ). 

For the proposed hierarchical prior, the posterior distribution of configu- 
rations is given by 



and the variances ratio 7 = t 2 /<t 2 (Abramovich and Angelini [1]). 

Given the posterior distribution ir n (x,k\y), we apply a maximum a pos- 
teriori (MAP) rule to choose the most likely configuration of true and false 
nulls. Generally, to find the posterior mode of 7r n (x, k\y), one should look 
through all 2™ possible configurations. However, for the proposed model, the 
number of candidates for a mode is, in fact, reduced to n + 1 only. Indeed, let 
x(k) be a maximizer of (2.8) for a fixed k that indicates the most plausible 
configuration with k false null hypotheses. From (2.8) it follows immediately 
that Xi{k) = 1 for the k tests with the smallest Bayes factors Bi and is zero 




(2.8) 




where the Bayes factor Bi of Hqi is 



(2.9) 
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otherwise. Due to the monotonicity of Bi in \y\i [see (2.9)], this is equiva- 
lent to Xi(k) = 1 corresponding to the k largest \y\i and zero for the others. 
The Bayesian MAP multiple testing procedure then leads to finding k that 
maximizes 

lii7r n (s(fc), k\y) = vfi) + 2^(1 + 1/7) ln{ ( \ ) ' 7r„(fc)(l + 7) ^ } + const. 
i=i 

or, equivalently, minimizes 

jlvl) + 2^ 2 (l + 1/7) ln{ (j) ^(l + 7 ) fc/2 }. 

The k null hypotheses corresponding to |y|(x), • • • , \y\^\ ar e rejected. The 
resulting Bayesian testimation yields hard thresholding with a threshold 

Amap = |y|(fc) : 

P* = I Vii \Vi\ - ^map, 
1 1 0, otherwise, 

and is, in fact, the posterior mode of the joint distribution (fi,x,k\y). 

From a frequentist view, the above MAP estimator /}* is a penalized 
likelihood estimator (1.6) with the complexity penalty 

(2.10) P n (k) = 2a 2 (l + I/7) ln{ (j) ^\k){l + 7)^}. 

Rewriting (J) = Y\f =1 (n - i + and n n (k) = 7r n (0) Ui=i^n(i)Mi- 1), 

(2.10) yields 

P„(fc) = 2a\l + I/7) (lnvr-^O) + £ ln{ - = ) + 1 VTT^}) 

(2.11) ^ 

i=0 

where A , n = 2cr 2 (l + l/ 7 ) In 7^(0), X i>n = 2a 2 (l + l/ 7 ) l n (^=p^gi x 

V 1 + 7), 2 = 1, ...,n. 

In such a form the penalty (2.11) is similar to (1.5) with the notation Aj jn 
instead of Xf n since some of them might be negative in a general case. 

A specific form of the resulting Bayesian hard thresholding rule depends 
on the choice of a prior ir n (k). In particular, the binomial prior B(n,^ n ) 
yields a fixed threshold 2a 2 (l + I/7) hi(^ V^T 7 /) ~ 2a 2 01(^^/7) for 
sufficiently large 7. The AIC criterion corresponds to £ n ~ + V7)' 
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while £ n = 1/n leads to the universal thresholding of Donoho and John- 
stone [9]. Abramovich and Angelini [1] showed that the "reflected" trun- 
cated Poisson distribution 7T n (k) oc (n — X n ) n ~ k / (n — k)\, with A n = o{n) 
satisfying \ n /^/n\nn — > oo, approximates the FDR thresholding procedures 
of Benjamini and Hochberg [7] and Sarkar [23] with the FDR parameter 

q n ~ (y^ln^n/An))" 1 . 

Remark 1. There is an intriguing parallel between the penalty P n (k) 
in (2.10) and 2/cln(n/fe)-type penalties (1.7) introduced above. For k<^n, 
In (£) ~ kln(n/k) and the penalty (2.10) is of 2/cln(n/fc)-type, where C = 
(1 + I/7) and c k , n = 0-/k) lnTr" 1 ^) + (1/2) ln(l + 7) is defined by the choice 
of the prior ir n (k). The 2/cln(n//c)-type penalty can be viewed, therefore, as a 
particular case of the more general penalty (2.10) for 7r n (k) satisfying Cfc in = 
0(ln(n/A;)), or, equivalently \mr n (k) = 0(kln(k/n)) for k <C n. Through- 
out the paper we discuss the relations between P n (k) and 2/cln(n/A;)-type 
penalties in more detail. 

In what follows, we study optimality of the proposed thresholding MAP 
estimators. 

3. Oracle inequality. In this section we derive an upper bound for the 
quadratic risk p(fi*,ij,) = E\\jl* — /i|| 2 of the MAP thresholding estimator and 
compare it with the ideal risk of an oracle estimator. 

Assumption (A). Assume that 

(3.12) vr n (A:)< (^j e~ c ^\ k = 0,...,n, 
where 0(7) = 8(7 + 3/4) 2 . 

Theorem 1 (Upper bound). Under Assumption (A), 

p(fi*,fj) <c (7) inf \ A«(i) 

(3.13) +2 f 7 2 (l + l/7)ln((^7r n (fcr 1 (l + 7) fc/2 )} 
+ ci(7)cj 2 

for some 00(7) and 01(7) depending only on 7. 
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The obvious inequality > (n/k) k implies that, for any -ir n (k), (3.12) 
holds for all k < ne~ c ^ . Applying the arguments very similar to those in the 
proof of Theorem 1, one gets then a somewhat weaker general upper bound 
for the quadratic risk of £i* without the requirement of Assumption (A): 



COROLLARY 1. For any prior ir n (-), 

> 



p(fl*,H) < c (7) inf <^ J2 (4 



+ 2a 2 (l + V7)ln((^) ^(fc^U + 7)^) | 

+ Ci(7)fJ 2 , 

where 0(7) = 8(7 + 3/4) 2 , and 00(7) and 01(7) depend only on 7. 

Note that the upper bounds in Theorem 1 and Corollary 1 are nonasymp- 
totic and hold for any \x £ M. n . 

In order to assess the quality of the upper bound in (3.13), we compare 
the quadratic risk of the MAP estimator with that of the ideal estimator 
Aoracie which one could obtain if one had available an oracle which reveals 
the true vector fj>. This ideal risk is known to be 

n 

(3.14) p(/i orac i e ,/i) = ^]min(/i 2 ,cr 2 ) 

i=i 

(Donoho and Johnstone [9] ) . The ideal estimator /2 racie is obviously unavail- 
able but can be used as a benchmark for the risk of other estimators. Note 
that the risk of /2 racie is zer ° when \x = and, evidently, no estimator can 
achieve this risk bound in this case. An additional (usually negligible) term 
a 2 , which is, in fact, an error of unbiased estimation of one extra parameter, 
is usually added to the ideal risk in (3.14) for a proper comparison (see, e.g., 
Donoho and Johnstone [9]). 
Define 



(3.15) 



Lo,„ = 21nvr- 1 (0), 

L fc|n = (l/fc)ln((^7r- 1 (fc)), k = l, 



and let L* = maxo<fc< n -^fc,n- The following theorem states that the MAP 
thresholding estimator performs within a factor of C2(7)(2L* + ln(l +7)) 
with respect to the oracle. 
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Theorem 2 (Oracle inequality). Consider the MAP thresholding es- 
timator fi* and the corresponding penalty P n {k) defined in (2.10). Under 
Assumption (A), 

p(p,*, fi) < c 2 (j)(2L* n + ln(l + 7))(p(Aoracic, fi) + o- 2 ) 

for some 02(7) depending only on 7. 

To understand how tight the factor C2(7)(2L* + ln(l + 7)) is, recall that 
when n is large, there is a sharp upper bound for the quadratic risk, 

(3.16) infsup— — — ILJh = 21ogn(l + o(l)) as n — ► 00 

A* A* P(^oracle,^J + C 

(Donoho and Johnstone [9]). 

Therefore, no available estimator has a risk smaller than within the factor 
21ogn from an oracle. Obvious calculus shows that if 7T n (k) > n~ ck ,k = 
1, . . . ,n, and 7r n (0) > n~ c for some constant c > 0, then L* = O(logn) and 
the MAP estimator achieves the minimal possible risk among all available 
estimators (3.16) up to a constant factor depending on 7: 

Corollary 2. Let ir n (k) satisfy Assumption (A) and, in addition, 
^n{k) > n~ ck , k = 1, . . . ,n, and 7r n (0) > n~ c for some constant c > 0. The 
resulting MAP estimator fi* satisfies 

(3.17) sup .„ ^ ^ 5- = 03(7)2 log nil + o(l)) as n — > 00 

A 4 P ( Coracle, A* ) + a 

for some 03(7) > 1. 

In particular, Corollary 2 holds for ln7r n (fc) = 0(kln(k/n)) corresponding 
to the 2&;ln(n/A;)-type penalties (see Remark 1 in Section 2.2). However, the 
condition ir n (k) > n~ ck required in the corollary is much weaker and covers 
a far wider class of possible priors. 

4. Minimaxity and adaptivity in sparse settings. The results of Section 
3 hold for any fi £ M n . In this section we show that they can be improved if 
an extra sparsity constraint on fj, is added. We start by introducing several 
possible ways to quantify sparsity and then derive conditions on the prior 
7T n (-) which imply asymptotic minimaxity of the resulting MAP estimator 
/}* over various sparse settings. 
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4.1. Sparsity. The most intuitive measure of sparsity is the number of 
nonzero components of /jl, or its Zo quasi-norm: ||^||o = #{i : fii ^ 0, i = 1, . . . , n}. 
Define a Zo-ball lo[rj] of standardized radius i] as a set of /x with at most a 
proportion rj of nonzero entries, that is, 

l [r ] ] = {fi€R n :\\fi\\ <T ] n}. 

In a wider sense sparsity can be defined by the proportion of large entries. 
Formally, define a weak Z p -ball m p [rj\ with standardized radius rj as 

m p [ V ] = {/i 6 K" : |/i| (i) < vHi) 1/p , j = 1, . . . , n }. 

Sparsity can be also measured in terms of the Z p -norm of a vector. A strong 
Zp-ball l p [rj\ with standardized radius r\ is defined as 

l p [ V } = LeR n :±p^i\ P <rf}. 

There are important relationships between these balls. The Z p -norm ap- 
proaches Zo as p decreases. The weak Z p -ball contains the corresponding 
strong Zp-ball, but only just: 

l p [rj\ c m p [rj\ (£ l p r[rj\, p' > p. 

The smaller p is, the sparser is [i. Sparse settings correspond to p < 2 
(e.g., Johnstone [21]). 

4.2. Minimaxity in sparse settings. We now exploit Theorem 1 to prove 
asymptotic minimaxity of the proposed MAP estimator over various sparse 
balls defined above. For this purpose, we define minimax quadratic risk over 
a given set n in (1.1) as 

R(@ n ) = inf sup E\\fl — fi\\ 2 

and examine various sparse sets n , namely, Zo, strong and weak Z p -balls, 
where sparsity assumes that the standardized radius rj tends to zero as n 
increases. 

The general idea for establishing asymptotic minimaxity is common for 
all cases: for each particular setting, we find the "least favorable" sequence 
Ho = fJ.o(p,r]) and the "equilibrium point" k* n = Zc*(p, rj) that keeps balance 
between Y^=k*+i A*oi an< ^ ^ ne P ena lty term P n (/c*) on the RHS of (3.13). We 
show that a requirement 7r n (Zc*) > (k^/n) Cpkn for some c p > on the prior 
7r n (-) at a single point k* n is sufficient for optimality of the MAP estimator 
jl* . Sections 4.2.1-4.2.3 show that the MAP thresholding estimator achieves 
asymptotic optimality up to a constant factor in a variety of sparse settings 
listed above. 
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4.2.1. Optimality over lo-balls. Consider an fo-ball lo[rj\, where 77 — > as 
n — > 00. Then, by Donoho et al. [12], 

#(/oM) -^(2 In rT 1 ), 

where the relation "~" means that the ratio of the two sides tends to one 
as n increases. 

Theorem 3. Define A;* = nrj. Let ij — > as n — > oo (sparsity assump- 
tion) but nrj -f* 0. If there exists a constant cq > such that 7r n (A;*) > 
(fc*/n) c ° fc ", i/ten i/ie MAP estimator fi* achieves optimality up to a con- 
stant factor, that is, 

sup E\\fi* — /j|| 2 = 0{nr](2\nr]~ 1 )). 

4.2.2. Optimality over weak l p -balls. Consider a weak Z p -ball to p [?/-], < 
p < 2, and let r\ — > as n — ► oo. In what follows we distinguish between sparse 
cases, where n l / p rj > \/21nn, and swper-sparse cases, where n 1 ^^ < v / 2hm. 
From the results of Donoho and Johnstone (e.g., Johnstone [20, 21], Donoho 
and Johnstone [10, 11]) it is known that 

1 -a 2 nri p (2lnr ] ~ p ) 1 ~ p/2 , n 1 ^ > v / 21nT 



in, 



R(m p [r]}) 



2-p 

1 -a 2 n 2/p i] 2 , n 1 ^!] < V2h^n~. 



2-p 

Theorem 4. Let i] — ► as n — > oo. 

1. Lei n 1//p r/ > \/2 In n (sparse case). Define fc* = ni] p (lnr]~ p )~ p / 2 . If there 
exists a constant c p > suc/i 7r n (fc*) > (k*/n) c P k ™ , then 

sup - /i|| 2 = 0(n7 ? p (21nr ? - p ) 1 - p / 2 ). 

/iSm p [jj] 

2. Lei n 1 ^^ < v / 2hm (super-sparse case) but n l / p rj -/* 0. If there exists a 
constant c p > suc/i i/iai 7r n (0) > exp(— c p r] 2 n 2 / p ), then 

sup E\\fi* - n\\ 2 = 0{n 2,p n 2 ). 

fj,£m p [rj\ 

4.2.3. Optimality over strong l p -balls. The minimax risk over a strong l p - 
ball, < p < 2, is the same as over the corresponding weak Z p -ball Jn p [r/] but 
without the constant factor 2/(2— p) (Johnstone [20], Donoho and Johnstone 
[11]), that is, 

1 p [7?J j ~ 1 a 2 n 2 / p r] 2 , n 1 /^ < v / 2hi^. 
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Theorem 5. Let n — > as n — > oo. 





4.3. Adaptivity. In Sections 4.2.1-4.2.3 for sparse cases we established 
optimality of the MAP estimator over a given ball if the condition ir n (k) > 
(k/n) ck holds at the single "equilibrium point" A;* depending on parameters 
p and n of a ball. From the results of Theorems 3-5 it follows immediately 
that if this condition holds for all k = l,...,K n , with some K n < n, the 
corresponding MAP estimator jx* is adaptive in the sense that it achieves 
optimal convergence rates simultaneously over an entire range of balls: 

Theorem 6 (Adaptivity). Let n [n] be any of lo[rj\, l p [v] or m p[ r i\> 
where 77 — > as n — > 00. If there exists K n = o(n) such that (lnn)/«; n — > as 
n — > 00 and ir n (k) > (k/n) ck for all k = 1, . . . , K n , and some constant c > 0, 
i/ien, /or sufficiently large n, 



for all0<p<2 and n p G [n~ 1 (21nn) p / 2 ;n" 1 K n ]. 

For lo-balls, it is sufficient to require K n as n —> 00 [instead of (Inn)/ 
K n - > 0/ in order that (4-18) hold for all n < n n n~ l such that nn 7^ 0. 

The sufficient requirement 7r n (k) > (k/n) ck for adaptivity established in 
Theorem 6 corresponds to 2fcln(n/fc)-type penalties (see Remark 1). At the 
same time, the proofs of Theorems 3-5 indicate that this condition is, in fact, 
also "almost necessary." Thus, essentially only 2/cln(n/fe)-type complexity 
penalties lead to adaptive estimation. 

It is natural to find priors for which the optimality range for n in Theorem 
6 is the widest. From Theorem 6 it is clear that such priors should be of the 
form ir n (k) tx (k/n) ck , k = 1, . . . , K n , where K n = o{n) should be as large as 
possible. The function (k/n) k decreases for k <n/e and, hence, for all c > 1, 
we have 



(4.18) 



sup E\\p,*-n\\ 2 = 0(R(Q n [rj\)) 
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The widest possible ranges for r/ in Theorem 6 are, therefore, achieved for 
priors of the form ir n (k) = (k/n) ck ,k = 1, . . . , K n , where c > 1 and 5 n = n n /n 
tends to zero at an arbitrarily slow rate. The resulting ranges are n <5 n and 
rf G [n~ l (2\n.n) p l 2 \5 n ] for Iq and Z p -balls, respectively, and cover the entire 
spectrum of sparse cases. Prom Lemma A.l in the Appendix it follows that 
In (?) ~ k\u{n/k) for all k = o(n), and the corresponding complexity penalty 
Pft(jfe) in (2.10) is then 

P n (k) = 2a\l + l/ 7 ) ln| (™) (n/kf k (l + 7 ) fc / 2 } 

~ 4cr 2 c(l + 1/7) jfc An(n/fc) + ~z ln(l + 7)^ , 

where c = (l/2)(c + 1) > 1. Such a penalty is obviously of the 2kln(n/k)- 
type, although, by analogy, more appropriately, it should be called of the 
4k ln(re/fc)-type. 

To complete this section note that, from a Bayesian viewpoint, it is also 
important to avoid a well-known Bayesian paradox where a prior (and, 
hence, a posterior) leading to an optimal estimator over a certain set has 
zero measure on this set. Hence, conditions on ir n (k) should guarantee, in 
addition, that, with high probability, a vector \x generated according to this 
prior distribution falls within a given ball. We discuss this issue in more 
detail in the examples in Section 5. 

5. Examples. In this section we consider three examples of TT n (k) and 
establish conditions on the parameters of these distributions imposed by 
the general results of the previous sections. 

5.1. Binomial distribution. Consider the binomial prior B(n,£ n ), where 

*n(fc)=(£)^(l-£n) n -*, k = 0,...,U. 

The binomial prior suggests independent Xi with P{xi = 1) = £ n , i = 1, . . . , n. 

Assumption (A) evidently holds for any £ n < e~ c ^\ We now find £ ra which 
satisfies the conditions of Corollary 2 and, therefore, for which the resulting 
MAP estimator achieves the minimal possible risk (up to a constant factor) 
among all available estimators in the sense of (3.17). 

For k = 0, 7r n (0) = (1 — £ n ) n and in order to satisfy 7r n (0) > n~ c for 
some c > 0, £ n should necessarily tend to zero as n increases. Assump- 
tion (A) definitely holds in this case (see above). Furthermore, (1 — £ n ) n = 
exp{— n£ n (l + o(l))} and 7r n (0) > n~ c when £ n < ci(lnn)/n for any c\ < c. 

On the other hand, let £ n > n~ C2 for some C2 > 1. Then, for all k > 1, we 
have 

tt„(ao > ^(i_e„r- fc > e£(i-£nr > n- c ^n- c > n - 5k , 
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where c = c + C2 . 

Summarizing, the validity of Corollary 2 for the binomial prior B(n,£ n ) 
is established for 

(5.19) n" C2 < f n < cilnn/n, 

where ci > and C2 > 1. 

The condition (5.19) holds, for example, for universal thresholding, where 
£ n ~ 1/n, but not for the AIC criterion, where £ n ~ y/j/(e + ^/7~) (see 
the discussion in Section 2.2). Abramovich and Angelini [1] showed that 
for £ n < \J 7T7 In n jn the binomial prior leads to the Bonferroni multiple 
testing procedure with the familywise error rate (FWE) controlled level 

i^M^j^ir 1 < 1. 

As we have already mentioned in Section 2.2, the binomial prior yields a 
fixed threshold A 2 = 2<t 2 (1 + I/7) In^&^Jl + 7) and, hence, (5.19) implies 

A 2 = 2a 2 (l + l/ 7 )(lne r 7 1 )(l + o(l)) ~ 2a 2 c( 7 )(lnn). 

In fact, the following proposition shows that £ n , from (5.19) also satisfies 
the conditions of Theorem 6 and, therefore, yields an adaptive optimal MAP 
estimator within the entire range of various types of sparse balls. 

Proposition 1. Let £ n satisfy (5.19). Then (4. 18) holds for the re- 
sulting MAP estimator for all weak and strong l p -balls with <p < 2 and 
rf £ [n _1 (21nn) p / 2 ;^ 3 ], and for l^-balls with rj G [c4?i -1 ; £^ 3 ], where < C3 < 
1 /C2 < 1 and C4 > can be arbitrarily small. 

The widest possible ranges for rj covered by Proposition 1, namely, rf £ 
[n~ 1 (21nn) P//2 ; (ci lnn/n) C3 ] for 0<p<2 and r/ G [c4?i~ 1 ; (ci lnn/n) C3 ] for 
p = 0, respectively, are obtained for £ n = c\ Inn/n. These optimality ranges 
are still smaller than those for priors of the type 7T n (k) = (k/n) ck ,c> 1, 
discussed in Section 4.3. 

On the other hand, to avoid the Bayesian paradox mentioned at the end 
of Section 4.3, exploiting Lemma 7.1 in Abramovich et al. [4], we have for 
Zo-balls, 

P(k > nifj < e "( 1 / 4 ) n ^ nmin 'tl ,? ^ n ~ 1 l'l ,? / ?,l " 1 l 2 ^, 

and the above probability tends to zero as n increases for rj > C5 (Inn/n), 
where C5 > 2ci . 

For strong / p -balls, define the standardized z = u/r and apply Markov's 
inequality to get 

P{\\H\\1 > nrf) < e- n< */Ty Ee \\z\\l _ 
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For the hierarchical prior model introduced in Section 2.2, £'e" z "p = 
E(Ee^ z ^p\x) = E 7Tn dp, where a p = Ee^ P and £ is a standard normal N(0, 1). 
It is easy to verify that e < a p < oo for < p < 2. Thus, 

(5.20) P(Ml>n^)<e-<^ v E^a k p . 

For the binomial prior, 

K n a k p = (CnS + 1 - = e^K-D^W). 

If rf > T p ^ n (a p — 1) + cg(lnlnn/n), where C6 > is arbitrarily small, then 
P(n S l p [rj\) — > 1 and, therefore, P(fi 6 — ► 1 as well. We believe that 
after extra effort it is possible to somewhat relax the conditions for weak 
^p-balls, but the resulting additional benefits are usually minor. 

Combining these Bayesian admissibility results with Proposition 1, we 
obtain the admissible optimality ranges for the binomial prior B(n,£ n ) with 
n -c2 < £ n < d (In n)/n, c\ > and oi > 1: 

7] G [c 5 rc -1 lnn;££ 3 ], p = 0, 

7f € [max{n- 1 (21nn) p/2 ,r^ n (a p - l)+c 6 n- 1 lnlnn};C], 0<p<2, 
where < C3 < I/C2 < 1, C5 > 2ci and C6 > can be arbitrarily small. 

5.2. Truncated Poisson distribution. Consider now a truncated Poisson 
distribution, where 

X k lk\ 

TTn(k) = - — -■ , k = 0,...,n, 

and 1 < A n < n. Application of Stirling's formula and simple calculus yield 
the following bounds on Tr n (k): 

■ x ,k e k-X n -l/(12k) X k 



(t) 7B < iff' 



(5.21) 

< A^/fc! ^ / Ari\ fc+1//2 e fc_A n +l/(12A n ) 

xt/x n \ v k J 

From (5.21) and Lemma A.l from the Appendix one has 
m7r re (A0-(m(")-M7)) 

(5.22) 

/A n e c ( 7 ' +1 \ 1 11 
< fclnl — I In k — X n H 1 — lnA n . 



n J 2 12A„ 2 

The function x - (l/12z) - (1/2) lnx > for all x > 1 and the RHS of (5.22) 
is negative for all k > 1 when A n < ne~( c ^ +1 \ Hence, Assumption (A) is 
satisfied for X n < ne~^ c ^ +l \ 
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We now check conditions for Corollary 2. For k = 0, one has 7r n (0) > e~ Xn 
and the requirement 7r n (0) > n~ Cl of Corollary 2 is satisfied for X n < cilnn. 
Note that this requirement immediately yields Assumption (A). 

On the other hand, let, in addition, A n > n" C2 for some C2 > 0. For k = 
1, (5.21) implies vr n (l) > \ n e~ Xn > n~( Cl+C2 \ while, for k > 2, note that 
k - l/(12fc) - (l/2)ln(27r/c) > 0, and, therefore, one has from (5.21) 

ln7r n (fc) > kln^j^j — X n > kln^-j— ^ — c\ Inn 
> fclnf ) > fclnn-( Cl+C2+1 ). 



n 

Thus, for the truncated Poisson prior Corollary 2 holds if 

(5.23) n~ C2 < A n < cilnn, 

where c\ > and C2 > 0. 

In particular, Abramovich and Angelini [1] showed that for X n < \J irj In n 
the corresponding MAP testing procedure controls the FWE at the level 
ct n ~ A n ( In ( y/^jn / A n ) ) ~ 1 < 1 and is closely related to the FWE control- 
ling multiple testing procedures of Holm [18] and Hochberg [17]. 



Proposition 2. Let X n satisfy (5.23). Then (4.18) holds for the re- 
sulting MAP estimator for all weak and strong l p -balls with < p < 2 and 
rf G [n- 1 (21nn) p / 2 ;(A„/n) C3 ], and for l -balls with rj G [^n -1 ; {X n /nf% 
where 0<C3<1/(1+C2) and C4 > can be arbitrarily small. 



Consider the corresponding Bayesian admissibility requirements for the 
truncated Poisson prior. For £o-balls, in the proof of their Lemma 1, Abramovich 
and Angelini [1] showed that, with X n = o(n) for any 5 n = o(n), 

P(k >X n + 5 n ) < Cnu n , 

where u n = e 5n /(l + 5 n /X n ) Xn+5n+1 / 2 and, therefore, lnu n < 5 n (l-ln(S n /X n )). 
In particular, set 5 n = max(lnn, e^X n ), where £ > 2. Then 

P(k > X n + 5 n ) < Cne~ 5n tt-V < Crr^~ 2) -» 

and, hence, P(/x G /of 7 ?]) — * 1 f° r V > (^n + S n )/n. For A n satisfying (5.23), 
P(/i G ^o[ ? ?]) — ► 1 holds for 77 > c^Qxin/n), where C5 > 2max(l, e 2 c\). 

For /p-balls, exploiting (5.20) for the truncated Poisson prior and applying 
Stirling's formula, one derives 

k = SLooffiA 1 < Eg°=o°ffi/fc! ^^^^ 
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and, therefore, for rf > T p (X n /n)(a p — 1) + C6(lnlnn/n), where cq > is 
arbitrarily small, both P(/i G l p [rj\) and P(/j, G m p \rf\) tend to one. 

The resulting admissible optimality ranges for n for the truncated Poisson 
prior with n~ C2 < X n < c\ Inn, c\ > and C2 > are then given by 

Tj G [c 5 (lnn/n); (A n /n) C3 ], p = 0, 

G [n- 1 max{(21nn) p / 2 ,r p A n (a p - 1) + c 6 lnlnn}; (A n /n) C3 ], 0<p< 2, 

where < C3 < 1/(1 + C2), C5 > 2max(l, e 2 ci), and cq > can be arbitrarily 
small. 

Strong similarity between the results for truncated Poisson and binomial 
priors with £ n = A n /n is not surprising and is due to the well-known asymp- 
totic relations between Poisson and binomial distributions. 



5.3. Reflected truncated Poisson distribution. Finally, consider briefly a 
"reflected" truncated Poisson distribution 

(5 - 24) ^ ~ E]= (n - \ n r->/(n -j)V k ~ °' • • • ' 

and let X n = o(n) but X n /Vnlnn — > 00 as n increases. The motivation for 
such a type of prior and specific choice of A n comes from the fact that the cor- 
responding MAP testing procedure mimics the FDR controlling procedures 
of Benjamini and Hochberg [7] and Sarkar [23] (Abramovich and Angelini 
[1]). In particular, Abramovich and Angelini ([1], Lemma 2) showed that, 
almost surely, k = A n (l + o(l)) or, more precisely, \k — X n \ < V c 7 nmn > where 
c 7 > 4. 

The adaptivity results of Theorem 6 are somewhat irrelevant for such a 
narrow range of possible k since the "equilibrium point" &;* = A n (l + o(l)) = 
o(n) in Theorems 3-5 becomes essentially known. To apply Theorems 3-5, 
we need the following lemma. 

Lemma 1. Consider the reflected truncated Poisson prior ir n (k) with 
X n = o(n), A n /vn Inn ^00 as n — > 0. For k = A„(l + o(l)), there exists 
c> such that TT n (k) > (k/n) ck . 

Based on the results of Lemma 1, we can identify the radius 770 of the balls, 
where the resulting MAP estimator ft* is optimal. For lo-b a n s > Theorem 3 
yields 770 = (A„/n)(l + o(l)). Similarly, for < p < 2, applying Theorem 4 
and Theorem 5, we obtain the result that the corresponding r/o satisfies 
r7g(ln^ p )- p/2 = (A„/n)(l + (l)). 

We now show that there is no Bayesian paradox in this case and a vector 
u generated from 7r n (k) falls with high probability within the corresponding 
balls of radius 770- For Zo-balls, it follows since, almost surely, k = X n (l + 
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Table 1 

AMSE of various thresholding estimators 







0.5% 






5% 






50% 




T 


3 


5 


7 


3 


5 


7 


3 


5 


7 


Bin 


0.0192 


0.0194 


0.0172 


0.1496 


0.1372 


0.1245 


0.8929 


0.8196 


0.7796 


Poisl 


0.0192 


0.0194 


0.0176 


0.1497 


0.1374 


0.1246 


0.9157 


0.8245 


0.7780 


Pois2 


0.0187 


0.0195 


0.0173 


0.1564 


0.1389 


0.1256 


0.9687 


0.8271 


0.7795 


EbayesThresh 


0.0194 


0.0189 


0.0181 


0.1556 


0.1379 


0.1248 


1.8942 


1.3473 


0.8450 


universal 


0.1012 


0.0998 


0.0990 


0.1694 


0.1600 


0.1495 


1.9447 


2.5474 


2.7720 



o(l)) ~ nr]Q. For l p -balls, < p < 2, Abramovich and Angelini [1] proved that 
for the reflected truncated Poisson prior (5.24) we have Ek = A n (l + o(l)). 
Then, Markov's inequality implies 

P(\W P > nrfo) < = = r%(ln VT p / 2 - 0, 

ni] nn 

where v v is the pth absolute moment of the standard normal distribution. 

6. Some simulation results. A short simulation study was carried out to 
investigate the performance of several MAP estimators. 

The data was generated according to the model (1.1) with the sample size 
n = 1000. In £% percent of cases \ii were randomly sampled from A r (0,r 2 ), 
and otherwise jjti = 0. The parameter £ controls the sparsity of the true 
signal fj,, while r reflects its energy. We considered £ = 0.5%, 5% and 50% 
corresponding, respectively, to super-sparse, sparse and dense cases, and 
r = 3, 5, 7. For each combination of values of £ and r, the number of repli- 
cations was 100. The true values of a, r and £ were assumed unknown 
in simulations and were estimated from the data by the EM-algorithm of 
Abramovich and Angelini [1]. Our simulation study also confirmed the effi- 
ciency of their parameter estimation procedure. 

We tried three MAP estimators corresponding to the priors considered 
in Section 5, namely, binomial B(n, £) (Bin), truncated Poisson (Poisl) and 
reflected truncated Poisson (Pois2) with A = In addition, we compared 
performances of the above listed MAP estimators with the universal thresh- 
olding of Donoho and Johnstone [9] and the hard thresholding EbayesThresh 
estimator of Johnstone and Silverman [22] with a Cauchy prior. 

Table 1 summarizes mean squared errors averaged over 100 replications 
(AMSE) of various methods. Standard errors in all cases were of order several 
percent of the corresponding AMSE. 

The performance of all methods naturally improves as r increases. As 
is typical for any thresholding procedure, all of them are less efficient for 
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dense cases. Nonadaptive universal thresholding consistently yields the worst 
results. All MAP estimators behave similarly, indicating the robustness of 
the MAP testimation to the choice of the prior 7r n (-). They are comparable 
with EbayesThresh for very sparse and sparse cases, but strongly outperform 
the latter for dense signals. Partially this is explained by the poor behavior 
of the MAD estimate of a used in this case by EbayesThresh. However, 
even after substituting the true a in EbayesThresh, MAP estimators still 
remained preferable. 

7. Concluding remarks. In this paper we have considered a Bayesian 
approach to a high-dimensional normal means problem. The proposed hier- 
archical prior is based on assuming a prior distribution vr n (-) on the number 
of nonzero entries of the unknown means vector. The resulting Bayesian 
MAP "testimator" leads to a hard thresholding rule and, from a frequen- 
tist viewpoint corresponds to penalized likelihood estimation with a com- 
plexity penalty depending on ir n (-). Specific choices of vr n (-) lead to several 
well-known complexity penalties. In particular, we have discussed the re- 
lationship between MAP testimation and 2/cln(n/A;)-type penalization re- 
cently considered in a series of papers. We have investigated the optimality 
of MAP estimators and established their adaptive minimaxity in various 
sparse settings. 

In practice, the unknown parameters of the prior and the noise variance 
can be efficiently estimated by the EM algorithm. The simulation study pre- 
sented illustrates the theoretical results and shows the robustness properties 
of the MAP testimation procedure to the choice of 7r n (-). 

We believe that the proposed Bayesian approach for recovering a high- 
dimensional vector from white noise can be extended to various non-Gaussian 
settings and model selection problems, although appropriate adjustments are 
needed for each specific problem at hand. 



A.l. Proof of Theorem 1. We show that, for a prior satisfying Assump- 
tion (A), the corresponding penalty P n (k) in (2.10) belongs to the general 
class of penalties considered in Birge and Massart [8]. In particular, we ver- 
ify that it satisfies conditions (3.3) and (3.4) of their Theorem 2 and then 
use it directly to obtain the upper bound in (3.13). 

In our notation the conditions (3.3) and (3.4) of Birge and Massart [8] 
correspond respectively to 
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and 

(A.2) (l + l/ 7 )(2L fcin + ln(l + 7))>c(l + v /2L^) 2 , k = l,...,n, 

for some c > 1, where the weights L^ n were defined in (3.15). In fact, Birge 
and Massart [8] require their (3.4) for k = as well. However, note that 
P n (0) = 2<7 2 (1 + I/7) ln7r n (0) _1 > and, hence, this condition always holds 
for k = 0. 

The condition (A.l) follows immediately from the definition of Ljt >n , 
n / \ 

£^j e -"*.» = l_* n (0)<oo. 



We now turn to (A.2). Consider k > 1. Let i = y/L}. n . The condition (A.2) 
is then equivalent to the quadratic inequality 

(A.3) 2(1 + I/7 - c)t 2 - 2y/2ct + (1 + I/7) ln(l + 7) - c > 0. 

We now find c > 1, for which (A.3) holds for all t such that the corresponding 
Lfc n satisfy Assumption (A). For the determinant A of (A.3), one has 

j = 2c 2 - 2(1 + I/7 - c)((l + 1/7) ln(l + 7) - c) 

= 2(1 + l/ 7 )(c(ln(l + 7) + 1) - (1 + I/7) ln(l + 7)). 

Note that ln(l + 7) < 7 and, therefore, ln(l + 7 )(1 + I/7) < ln(l + 7) + 1. 
Hence, A > for any c > 1. If, in addition, c < 1 + I/7, then (A.3) holds for 
all t>t*, where i* is the largest root of the quadratic polynomial on the 
left-hand side of (A.3), 



t _ c + yj + I/7 Vc(ln(l + 7) + 1) - (1 + I/7) ln(l + 7) 

\/2(l + l/7-c) 

(A-4) 

c + I + I/7 
>/2(l + I/7 - c) ' 

Setting c = 1 + 1/(27), from (A.4) one has t* < 2^/2(7 + 3/4). On the other 
hand, Assumption (A) implies n > £(7) = 8(7 + 3/4) 2 and, therefore, 
t = \fL~k~n > 2\/2(7 + 3/4) > t*. Thus, c = 1 + 1/(27) guarantees the con- 
dition (A.3) and the equivalent original condition (A.2). 

A.2. Proof of Theorem 2. Consider first k > 1. For this case Assump- 
tion (A) implies L^ n > 0(7) > c(0) > 1. From Theorem 1, we then have 

p(A»<c ( 7 )(l + l/7) 

X X M ( E ^)+^ 2 ^(2^,n + ln (l + 7))|+ci(7V 2 
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<co( 7 )(l + l/7)(2^ + ln(l + 7)) 

r 2 



(A.5) x l ^{ n \ ^ 4i) + k(j2 



,i=k+l 



<c 2 ( 7 )(2L; + ln(l + 7 )) M E /xfo + fer 2 +a 2 

I - ~ n \i=fc+l / 



On the other hand, Theorem 1 implies 

p{pr,n) < co( 7 ){][> 2 + 2ff2 (! + V7) Ihtt-^O)! + Cl (j)a 2 

< co (7) I E + ct2 (! + V7)(io,n + 0.5 ln(l + 7)) 1 + c x ( 7 )a 2 . 



Define co( 7 ) = co( 7 )/ln(l + 7) and £1(7) = 2ci( 7 )/ln(l + 7). Obviously, 
c (7)(2L ,„ + ln(l + 7 ))>co( 7 ) 



and 



Hence, 



ci(7)(2Lo,n + ln(l+7))>2ci( 7 ). 



(A.6) 



p{F,p) < c (7)(2^o,n + ln(l + 7)) E 3 

i=l 

+ c ( 7 )(l + 1/7)(2L ,„ + ln(l + 7 )) C t 2 /2 
+ c 1 ( 7 )(2L , n + ln(l+ 7 ))a 2 /2 

< C2 ( 7 )(2L; + ln(l+ 7 ))|E^ + a 2 |. 
Combining (A.5) and (A.6), we have 

p(A^)<c 2 ( 7 )(2L; + ln(l+ 7 )){ inf ( E 4) + ^ ) + 



0<fc<n' 



4=k+l 



= c 2 ( 7 )(2L; + ln(l + 7 )) \ E min( M ?, a 2 ) + a 2 }> , 
which completes the proof. 



i=l 
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A. 3. Proof of Theorem 3. We start with Lemma A. 1, which will be used 
throughout the following proofs. 

Lemma A.l. 

1. ln(l)>kln(n/k),k = l,...,n. 

2. Let n/k — > oo as n — > oo. Then for any constant c > 1 for sufficiently 
large n, hi (?) < ckln(n/k). 

Proof. The first statement of Lemma A.l follows immediately from the 
trivial inequality 

To prove the second statement, note that, using Stirling's formula, one has 

n \ < ( n Y( e Y~ k (- k 

k) ~ \e) \n-k) \k 

(A.7) 

'n\ k f n \ n ~ k fn xk 
< 



^n — kj \k J \n — k. 

Since (1 — kjn)~ n l k — > e as n/k — > oo, for any c > 1 for sufficiently large n, 

.»-*)"= ((1 -* /b, ^' ) * < (e) ( *" ) *- 

Thus, from (A.7), (£) < (n/k) ck for sufficiently large n. □ 



Now we return to the proof of Theorem 3. Evidently, for any /i 6 lo[v]> 
Hu\ = 0,i > k* = nrj. Since k* = o(n), from the general upper bound for the 
risk established in Corollary 1, it follows that 

E\\fi* -fi\\ 2 

< co( 7 )2<7 2 (l + l/ 7 ) (ln{ ( n n J Tr-Hmy)} + ^ hi(l + 7 )) + Cl ( 7 P 
From Lemma A.l 
In 



a 



TT n l (nn)^ > In f M >nn\nn 1 >n??ln(l+7) 



n 

nr/ 

when rj — > as n — > oo. On the other hand, under the conditions of Theorem 
3, Lemma A.l implies 

/ n 
\nn / 

for sufficiently large n. Summarizing, one has — ^|| 2 < C2(7)<r 2 n?/lnr/ _1 . 



In 



7r n 1 (nr?) | < cnrj Inn 1 
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A. 4. Proof of Theorem 4. Define a "least-favorable" sequence 
Moi = ^(n/i) 1 ^, i = 1, ...,n, that maximizes X^=fc+i Mi over M ^ m pM f° r 
any = 0, . . . ,n — 1. For > 1, 

(A.8) £ Mo* < ^^ 2/P f°° *~ 2/P dx = -Z-rfnVn 1 - 2 '*, 

i=k+i Jk p 

while, for k = 0, 

n 

i=i 

where £(•) < oo is the Riemann Zeta-function. 

1. n 1//p r; > \/21nn (sparse case). 
In this case, 1 < fc* = o(n) and from Corollary 1, Lemma A.l and (A.8), one 
has 

£||A*-mI| 2 <c (7){ E Mo i + 2^ 2 (l + l/7) 

' n 



( fe ;j+l n7 r-i ( fc*) + ^ln(l + 7 ) 



+ ci(7)o- 2 



\2-p 

+ h {j)a 2 (k* \n(n/k* n ) + lmr~ 1 (k* n ))\. 



To complete the proof for this case, note that ?7 2 n 2 / p (/E*) 1 2//p = nrfilnr] P Y p l 2 
and under the conditions on vr„(A;* ) of Theorem 4, 

Khi[n/K) +^n\K) < (c P + l)fc;in(^(ln ? r P ) p/2 ) 

= 0(n? ? p (hi7r p ) 1_p/2 ). 

2. n l / p r] < \/2 In n (super-sparse case). 
In this case Corollary 1 and conditions on vr n (0) imply 

£||M* " Mil 2 < co(7)| EMoi + 2a 2 (l + l/ 7 ) Invr-^O)} + Ci ( 7 )<t 2 



.i=i 



< coi^Wn^CiVp) + 2^ 2 (1 + l/7)c p?? 2 n 2 /f} + Cl ( 7 ) 
= 0(r ? V/ ? '). 



2 
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A. 5. Proof of Theorem 5. First, we find the "least favorable" sequence 
fiQ that maximizes Yn=k+i over A 4 e 'pM f° r a given = 0, . . . , n — 1. 
Applying Lagrange multipliers, after some algebra one has 

Tl / r) \ lip 

for A; > 1 and 

n 
i=l 

for k = 0. The rest of the proof therefore repeats the proof of Theorem 4. 

A.6. Proof of Proposition 1. First, note that 7T n (l) =n£ n (l — Cn)™" 1 and 
the condition 7r n (l) > n _c in Theorem 6 requires that £ n — ► as n — ► oo. In 
particular, it implies 

(1 - £„) n = ((1 - &r 1/6 0"* n = exp{-ne„(l + o(l))} 
and, using Lemma A.l, we then have 

*»(*) > - fn)" = (^exp{-^(l + 0(1))})". 

To satisfy ir n (k) > (k/n) ck , it is sufficient to have 
(A.9) lf, (1+o(1)) _ ln (^)< cln g). 

Recall that n~ C2 < £ n < ci(lnn)/re, where ci > and C2 > 1. Define K n = 
n££ 3 , where < C3 < I/C2. Obviously, K n /n — > 0, ft n > n 1_C2C3 = n^c) > 0, 
and, therefore, (lnn)//« n — ► as n — ► 00. 

For 1 < k < n£ n , using the monotonicity of the function — xlnx for x < 
1/e, we have 

k fn\ Inn . 

-In - > >c x £ n 

n \ k J n 

and, therefore, 

^(1 + 0(1)) -ln(^) < ^(l + o(l))< Cl (l + (l))lng) 

which yields (A.9). 

On the other hand, for all ra£ n , < k < K n , we have £ n > (k/n) 1 ^ 3 = (k/n) 1+c 
where c > 0, which yields (n£ n /k) > (k/n) c . Thus, 

^(1 + o(l)) - ln(^) < (1 + o(l)) + clng) 

and (A.9) holds. 

Applying Theorem 6 for n n = n^ s completes the proof. 
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A.7. Proof of Proposition 2. Define K n = n(X n /n) C:i , where < C3 < 
1/(1 + C2). Obviously, K n > 77, 1 ~( 1 + c 2) c 3 = n <5 ^ w here < 5 < 1 and, there- 
fore, (lnn)/K n — > and K n /n = (A n /n) C3 < (ci lnn/n) C3 — > as n — > 00. 

For fc = 1, we have shown in Section 5.2 that 7r n (l) > n~^ Cl+C2 - ) . For k>2, 
exploit positivity of the function k — l/(12fc) — (1/2) ln(2-7r/c) to obtain from 
(5.21) 

ln7r n (fc) > kln(Jj^j - X n . 

The rest of the proof essentially repeats the proof of Proposition 1 starting 
from (A. 9) with A n = n£ n and without o(l). 



A.8. Proof of Lemma 1. Applying Stirling's formula for large X n and 
k = A n (l + o(l)), after simple calculation, one has 

n — A 1 1 

\nir n (k) > (n - k)ln p + A„ - k - — — - -ln(27r(n - k)) 

n — k \2{n — k) 2 

= o(A n )ln((l+ °^ 



n - X n - o(X n ) 

~ o(\ n ) ~ ^ ln ( n ~ K ~ o(X n )) 
°\An) ~ 77 m ( n - A n) - 77 m 



2 v y 2 V n-X n 

= °( A n) - 7j m ( n - A n)- 

On the other hand, ckln(k/n) = cA n (l + o(l)) ln(A n /n) + o(A n ). Thus, to 
prove Lemma 1, it is sufficient to show that 

(A.10) ±ln(n-A„) < cA n ln(n/A n ) 

for some c > 0. 

Denote gi(X n ) = |ln(n — X n ) and g2(X n ) = X n ln(n/X n ). Note that gi(X n ) 
decreases while 52 (A n ) increases for X n < ra/e, and g\ (1) < (72(1)- Then, for 
any c > 1, one has cg2{X n ) > £(72(1) > 51(1) > gi(^n), which proves (A. 10). 
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